翻訳と辞書
Words near each other
・ Sentech
・ Sentech Tower
・ Senteg
・ Sentein
・ Sentek Global
・ Sentelie
・ Sentema
・ Sentenac-d'Oust
・ Sentenac-de-Sérou
・ Sentence
・ Sentence (law)
・ Sentence (linguistics)
・ Sentence (logic)
・ Sentence (music)
・ Sentence arrangement
Sentence boundary disambiguation
・ Sentence clause structure
・ Sentence completion tests
・ Sentence diagram
・ Sentence extraction
・ Sentence function
・ Sentence length
・ Sentence of Death
・ Sentence processing
・ Sentence Review Commission
・ Sentence spacing
・ Sentence spacing in digital media
・ Sentence spacing in language and style guides
・ Sentence spacing studies
・ Sentence word


Dictionary Lists
翻訳と辞書 辞書検索 [ 開発暫定版 ]
スポンサード リンク

Sentence boundary disambiguation : ウィキペディア英語版
Sentence boundary disambiguation
Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang.
Languages like Japanese and Chinese have unambiguous sentence-ending markers.
==Strategies==
The standard 'vanilla' approach to locate the end of a sentence:
:(a) If it's a period, it ends a sentence.
:(b) If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence.
:(c) If the next token is capitalized, then it ends a sentence.
This strategy gets about 95% of sentences correct.〔(【引用サイトリンク】title= Doing Things with Words, Part Two: Sentence Boundary Detection )〕 Things such as shortened names, e.g. "D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like ".hack//SIGN") and usage of non-standard punctuation (or non-standard usage ''of'' punctuation) in a text often fall under the remaining 5%.
Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model.〔(【引用サイトリンク】title=A Maximum Entropy Approach to Identifying Sentence Boundaries )〕 The (SATZ ) architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.

抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)
ウィキペディアで「Sentence boundary disambiguation」の詳細全文を読む



スポンサード リンク
翻訳と辞書 : 翻訳のためのインターネットリソース

Copyright(C) kotoba.ne.jp 1997-2016. All Rights Reserved.